Constrained key frame localization and mapping for vision-aided inertial navigation

ABSTRACT

Estimation techniques for vision-aided inertial navigation are described. In one example, a vision-aided inertial navigation system (VINS) comprises an image source to produce image data for a keyframe and one or more non-keyframes along a trajectory, the one or more non-keyframes preceding the keyframe along the trajectory. The VINS comprises an inertial measurement unit (IMU) to produce IMU data indicative of a motion of the VINS along the trajectory for the keyframe and the one or more non-keyframes, and a processing unit comprising an estimator that processes the IMU data and the image data to compute state estimates of the VINS. The estimator computes the state estimates of the VINS for the keyframe by constraining the state estimates based on the IMU data and the image data for the one or more non-keyframes of the VINS without computing state estimates of the VINS for the one or more non-keyframes.

This application is a continuation of U.S. application Ser. No.14/271,971 filed on May 7, 2014 and issued on Mar. 28, 2017 as U.S. Pat.No. 9,607,401, which claims the benefit of U.S. Provisional App. No.61/821,136, filed May 8, 2013. The entire content of each of which isincorporated herein by reference.

TECHNICAL FIELD

This disclosure relates to navigation and, more particularly, tovision-aided inertial navigation.

BACKGROUND

In general, a Vision-aided Inertial Navigation System (VINS) fuses datafrom a camera and an Inertial Measurement Unit (IMU) to track thesix-degrees-of-freedom (d.o.f.) position and orientation (pose) of asensing platform. In this way, the VINS combines complementary sensingcapabilities. For example, an IMU can accurately track dynamic motionsover short time durations, while visual data can be used to estimate thepose displacement (up to scale) between consecutive views. For severalreasons, VINS has gained popularity within the robotics community as amethod to address GPS-denied navigation.

SUMMARY

In general, this disclosure describes various techniques for use withina vision-aided inertial navigation system (VINS). More specifically,constrained keyframe localization and mapping (C-KLAM) techniques aredescribed. In one example, a maximum a posteriori (MAP) estimator-basedkeyframe approach for simultaneous localization and mapping (SLAM) isdescribed. As opposed to many existing keyframe-based SLAM approaches,that discard information from non-keyframes in order to reduce thecomputational complexity, the proposed C-KLAM presents a novel andcomputationally-efficient technique for incorporating at least a portion(e.g., most) of this information, resulting in improved estimationaccuracy.

In one example implementation, an approximate MAP estimator-based SLAMalgorithm, referred to as C-KLAM, is described. In order to reduce thecomputational complexity of a batch MAP-based SLAM, an estimator withina VINS applies C-KLAM to compute state estimates along a trajectory onlyfor the keyframes and key landmarks, observed from these keyframes.However, instead of discarding the measurement information fromnon-keyframes and non-key landmarks, C-KLAM uses most of thisinformation to generate consistent pose constraints between thekeyframes, resulting in substantial information gain. Moreover, theapproximations performed in C-KLAM retain the sparsity of theinformation matrix, and hence the resulting optimization problem can besolved efficiently.

In this way, the C-KLAM techniques project information from thenon-keyframes to the keyframes, using marginalization, while maintainingthe sparse structure of the information matrix, to generate fast andefficient solutions. In one example, the C-KLAM techniques project bothproprioceptive and exteroceptive information from the non-keyframes tothe keyframes, using marginalization, while maintaining the sparsestructure of the associated information matrix, resulting in fast andefficient solutions.

The performance of C-KLAM has been tested in both simulations andexperimentally, using visual and inertial measurements, to demonstratethat it achieves performance comparable to that of thecomputationally-intensive batch MAP-based 3D SLAM that uses allavailable measurement information. The results demonstrated that C-KLAMnot only obtains substantial speed-up, but also achieves estimationaccuracy comparable to that of the batch MAP-based SLAM that uses allavailable measurement information.

In one example, a vision-aided inertial navigation system comprises animage source to produce image data for a first keyframe, one or morenon-keyframes and a second keyframe along a trajectory of thevision-aided inertial navigation system (VINS). The one or morenon-keyframes are located between the first keyframe and second keyframealong the trajectory. The VINS comprises an inertial measurement unit(IMU) to produce IMU data indicative of a motion of the VINS along thetrajectory for the keyframe and the one or more non-keyframes. Aprocessing unit of the VINS comprises an estimator that processes theIMU data and the image data to compute respective state estimates for aposition and orientation of the VINS for the first keyframe and for thesecond keyframe. When computing the state estimates, the estimatorconstrains the state estimates for the second keyframe relative to thestate estimates for the first keyframe based on the IMU data and theimage data from the one or more non-keyframes. In one example, whenconstraining the state estimates, the estimator treats a landmarkobserved within the image data for the first keyframe and the secondkeyframe as different from the same landmark observed within the imagedata for the non-keyframes.

A method for computing state estimates for a vision-aided inertialnavigation system (VINS) comprises receiving image data produced by animage source of the vision-aided inertial navigation system for a firstkeyframe, one or more non-keyframes and a second keyframe along atrajectory of the vision-aided inertial navigation system (VINS), theone or more non-keyframes located between the first keyframe and secondkeyframe along the trajectory, and receiving, from an inertialmeasurement unit (IMU), IMU data indicative of motion of the VINS alongthe trajectory for the keyframe and the one or more non-keyframes. Themethod further comprises processing the IMU data and the image data tocompute state estimates of the VINS for the keyframe by constraining thestate estimates based on the IMU data and the image data of the one ormore non-keyframes of the VINS without computing state estimates for theposition and the orientation of the VINS for the one or morenon-keyframes. In example embodiments, constraining the state estimatescomprises treating a landmark observed within the image data for thefirst keyframe and the second keyframe as different from the samelandmark observed within the image data for the non-keyframes.

The details of one or more embodiments of the invention are set forth inthe accompanying drawings and the description below. Other features,objects, and advantages of the invention will be apparent from thedescription and drawings, and from the claims.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a block diagram illustrating a sensor platform comprising anIMU and a camera.

FIG. 2 is a block diagram illustrating an example current explorationepoch.

FIG. 3 illustrates a structure of the sparse information matrix A_(m)corresponding to a cost function c₂.

FIG. 4 is a logical diagram illustrating an example of the explorationepoch before (left) and after (right) the approximation employed inC-KLAM.

FIG. 5 illustrates a structure of the Hessian matrices, H_(c) ₁ andH_(c) ₂ corresponding to cost functions c₁ and c₂.

FIG. 6 is a logical diagram illustrating a Pictorial depiction of theapproximation carried out by C-KLAM in order to ensure sparsity of theHessian matrix.

FIG. 7 illustrates a structure of the Hessian matrix H _(c) ₂corresponding to cost function c ₂.

FIG. 8 shows the x-y view of the estimated trajectory and landmarkpositions.

FIG. 9 is a flowchart illustrating example operation of a device inaccordance with the techniques described herein.

FIG. 10 shows a detailed example of various devices that may beconfigured to implement some embodiments in accordance with the currentdisclosure.

DETAILED DESCRIPTION

One of the main challenges in designing an estimation algorithm formobile devices (e.g., mobile computing devices, mobile phones, robotsand the like) navigating in large environments over long time periodsusing Vision-aided Inertial Navigation System (VINS) is the inherenthigh computational complexity. For example, the computational complexityof the Minimum Mean Squared Error (MMSE) estimator for SimultaneousLocalization and Mapping (SLAM), i.e., the Extended Kalman filter, oftenused in VINS is O(N²) at each time step, where N is the number oflandmarks in the map. Similarly, for the batch Maximum A Posteriori(MAP) estimator-based SLAM (smoothing and mapping), the worst-casecomputational complexity is O([K+N]³), where K is the number of posesfor the mobile device in the trajectory. While existing batch MAP-basedSLAM approaches such as the √{square root over (SAM)}, g²o, and SPA seekto generate solutions by exploiting the sparsity of the informationmatrix, for large-scale SLAM with frequent loop closures, this costeventually prohibits real-time operation.

This disclosure presents MAP-based Constrained Keyframe Localization andMapping (C-KLAM) techniques, which compute state estimate of the VINSfor keyframes along with the positions of landmarks observed from thesekeyframes.

That is, the C-KLAM techniques provide an approximate batch MAP-basedalgorithm that estimates only keyframes (key robot/device poses) and keylandmarks while also exploiting information (e.g., visual observationsand odometry measurements) available to the non-keyframes. Inparticular, this information is projected onto the keyframes, bygenerating consistent pose (position and orientation) constraintsbetween them.

As used herein, the term keyframes refers to the individual poses of theVINS for which position and orientation of the VINS are to be estimated.In contrast, the term non-keyframes refers to intermediate poses betweenkeyframes and for which complete state estimates of the VINS are notcomputed. In example implementations described herein, information fromnon-keyframes, acquired between keyframes, is not discarded. Instead,this information is projected on to the keyframes, in order to generatetight constraints between the keyframes. For example, information from anon-keyframe may be projected onto a preceding keyframe to computerelative position and orientation constraints between the precedingkeyframe and the non-keyframe.

Aspects of the disclosure may have certain advantages. For example, incontrast to existing keyframe-based estimation methods, the C-KLAMtechniques described herein may utilize all available measurementinformation, both proprioceptive (e.g., IMU) and exteroceptive (e.g.camera), from non-keyframes to generate tight constraints between thekeyframes. This may be achieved by marginalizing the non-keyframes alongwith the landmarks observed from the non-keyframes. As another example,the C-KLAM techniques described herein incorporate information frommarginalized frames and landmarks without destroying the sparsity of theinformation matrix, and hence may be used to generate fast and efficientsolutions.

In addition, a cost marginalization of the C-KLAM techniques describedherein may be cubic only in the number of non-keyframes betweenconsecutive keyframes and linear in the number of landmarks observedexclusively from the non-keyframes.

Further, the keyframes and the associated landmark-map may be maintainedover the entire trajectory of a mobile device/robot. As such, the C-KLAMtechniques described herein may enable efficient loop closures, whichmay be necessary for ensuring accurate and consistent long-termnavigation.

FIG. 1 is a block diagram illustrating a vision-aided inertialnavigation system (VINS) 10 comprising at least one image source 12 andan inertial measurement unit (IMU) 14. VINS 10 may be a standalonedevice or may be integrated within or coupled to a mobile device, suchas a robot, a mobile computing device such as a mobile phone, tablet,laptop computer or the like.

Image source 12 images an environment in which VINS 10 operates so as toproduce image data 14. That is, image source 12 provides image data 14that captures a number of features visible in the environment. Imagesource 12 may be, for example, one or more cameras that capture 2D or 3Dimages, a laser scanner or other optical device that produces a streamof 1D image data, a depth sensor that produces image data indicative ofranges for features within the environment, a stereo vision systemhaving multiple cameras to produce 3D information, and the like. Forexample, image source 12 may produce image data for a first keyframe,one or more non-keyframes and a second keyframe along a trajectory ofVINS 10. The one or more non-keyframes being located between the firstkeyframe and second keyframe along the trajectory. In this way, imagedata 14 provides exteroceptive information as to the externalenvironment in which VINS 10 operates.

IMU 16 produces IMU data 18 indicative of a dynamic motion of VINS 10.IMU 14 may, for example, detect a current rate of acceleration using oneor more accelerometers as VINS 10 is translated, and detect changes inrotational attributes like pitch, roll and yaw using one or moregyroscopes. IMU 14 produces IMU data 18 to specify the detected motion.In this way, IMU data 18 provides proprioceptive information as to theVINS 10 own perception of its movement and orientation within theenvironment.

Estimator 22 of processing unit 20 process image data 14 and IMU data 18to compute state estimates for the degrees of freedom of VINS 10 and,from the state estimates, computes position, orientation, speed,locations of observable features, a localized map, an odometry or otherhigher order derivative information represented by VINS data 24.

In one example, estimator 22 comprises an EKF that estimates the 3D IMUpose and linear velocity together with the time-varying IMU biases and amap of visual features 15. Estimator 22 may, in accordance with thetechniques described herein, apply estimation techniques (referred to asC-KLAN techniques) that computes state estimates by projecting IMU dataand image data from non-keyframes to keyframes. That is, estimator 22computes the state estimates for position and orientation of the VINSfor the keyframes by constraining the state estimates for each of thekeyframes relative to a prior keyframe based on the IMU data 18 and theimage data 14 acquired for one or more preceding non-keyframes betweenthe keyframes by marginalizing the non-keyframes. In this way, thedescribed estimator applies a computationally-efficient technique forincorporating at least a portion of the information from non-keyframes,resulting in improved estimation accuracy.

For example, estimator 22 processes IMU data and the image dataassociated with the non-keyframes to compute one or more position andorientation constraints from a first keyframe to a second keyframe forthe VINS, where the first and second keyframes may be any pair ofkeyframes along the trajectory of the VINS. That is, estimator 22 mayprocess the IMU data 18 and the image data 14 associated with thenon-keyframes to compute constraints to changes in position andorientation from the first keyframe to the second keyframe for the VINS.The estimator may compute each of the constraints as (i) an estimate ofthe motion from the first keyframe to the second keyframe (i.e., howmuch VINS 10 has moved and/or rotated between the two keyframes), and(ii) a covariance (or information matrix) providing an indication of anuncertainty of the estimated motion.

Moreover, in one example implementation, when computing the constraintsbetween keyframes for the state estimates, estimator 22 may treat afeature observed within the image data for one or more of the keyframesas different from that same feature observed within the image data forthe non-keyframes. In other words, for purposes of computing the stateestimates for keyframes, estimator 22 may disregard dependencies foreach of the landmarks with respect to landmarks observed within theimage data for the one or more non-keyframes. In this way, estimator 22may constrain the state estimates for keyframes by marginalizing thenon-keyframes and non-key features from the estimate calculationcomputation.

In another example embodiment, estimator 22 may also, or alternatively,treat the same keyframe as being two or more different keyframes. Inother words for the purpose of computing the estimates of the keyframes,estimator 22 may disregard the fact that a keyframe is the same in eachof the constraints in which it is involved and treat the same keyframeappearing in multiple constraints as being different keyframes. Theseconstraints can correspond to constraints due to observations oflandmarks, or constraints due to motion as measured from the IMU, orconstraints induced by marginalizing non-keyframes and non-key features.

Furthermore, in one example, when computing state estimates, estimator22 may prevent projection of the image data and IMU data from both thekey-frames and the non-keyframes along at least one unobservable degreeof freedom. As one example, a rotation of the sensing system around agravity vector may be undetectable from the input of a camera of thesensing system when feature rotation is coincident with the rotation ofthe sensing system. Similarly, translation of the sensing system may beundetectable when observed features are identically translated. Bypreventing projection of image data 14 and IMU data 18 for bothkeyframes and non-keyframes along at least one unobservable degree offreedom, the techniques may improve consistency and reduce estimationerrors as compared to conventional VINS.

Example details of an estimator 22 for a vision-aided inertialnavigation system (VINS) in which the estimator enforces theunobservable directions of the system, hence preventing spuriousinformation gain and reducing inconsistency, can be found in U.S. patentapplication Ser. No. 14/186,597, entitled “OBSERVABILITY-CONSTRAINEDVISION-AIDED INERTIAL NAVIGATION,” filed Feb. 21, 2014, and U.S.Provisional Patent Application Ser. No. 61/767,701, filed Feb. 21, 2013,the entire content of each being incorporated herein by reference.

Example Algorithm Description—Batch Least Squares Formulation

For purposes of explanation, batch MAP-based estimator for SLAM is firstexplained, where the objective is to compute estimates for VINS 10 fromtime-step 0 up to the current time-step k and estimates for all theobserved landmarks. To facilitate the description of SLAM estimationalgorithms, the specific example scenario depicted in FIG. 2 is used.Note, however, that C-KLAM described herein is a general approach thatcan be used for any number of key and non-key poses and landmarks.

Consider a robot or other mobile device containing VINS 10, equippedwith proprioceptive (e.g., IMU) and exteroceptive (e.g., camera)sensors, navigating in a 3D environment. The state vector can berepresented given by:

x _(0:k) ^(BA) =[X ₀ ^(T) X ₁ ^(T) . . . X _(k) ^(T) f ₁ ^(T) . . . f_(m) ^(T)]^(T)  (1)

where x_(i) denotes the robot pose (position and orientation) attime-step i, i=0, 1, 2, . . . , k, and f_(j) is the position of the j-thlandmark, j=1, 2, . . . , m, with respect to a global frame ofreference.

The motion model for a robot or other mobile device containing VINS 10between time-steps i−1 and i is described by the following genericnonlinear equation:

$\begin{matrix}{g_{i} = {{x_{i} - {f\left( {x_{i - 1},{u_{i - 1_{m}} - w_{i - 1}}} \right)}} = 0}} & (2)\end{matrix}$

where the true control input is

u_(i − 1) = u_(i − 1_(m)) − w_(i − 1), u_(i − 1_(m))

denotes the measured control input, and w_(i−1) is the zero-mean, whiteGaussian measurement noise with covariance Q_(i−1). To employ abatch-MAP estimator, (2) is linearized and the Jacobians are computedwith respect to the state vector (1) and the noise, respectively, i.e.,

$\begin{matrix}{{\Phi_{i - 1}\overset{\Delta}{=}{\left. \frac{\partial g_{i}}{\partial x_{0:k}^{BA}} \right|_{\{{x_{0:k}^{BA},0}\}} = \left\lbrack {0_{{X} \times {X}}\mspace{14mu} \ldots \mspace{11mu} \varphi_{x_{i - 1}\mspace{14mu}}I_{X}\mspace{14mu} \ldots \mspace{14mu} 0_{{X} \times 3}\mspace{14mu} \ldots \mspace{14mu} 0_{{X} \times 3}} \right\rbrack}}\;} & (3) \\{\mspace{79mu} {G_{i - 1}\overset{\Delta}{=}\left. \frac{\partial g_{i}}{\partial W_{i - 1}} \right|_{\{{x_{0:k}^{BA},0}\}}}} & (4)\end{matrix}$

where |x| is the dimension of the state of a single robot pose, ̂x_(0:k)^(BA) denotes the linearization point for the state (1) which is thecurrent best estimate, while a zero vector is used as the linearizationpoint for the noise, and

Φ_(x_(i − 1))

denotes the Jacobian with respect to the robot pose at time step i−1.

In one example, a robot or other mobile device containing VINS 10 isequipped with an exteroceptive image source 12, e.g., a camera, andobserves landmarks (e.g., feature 15 of FIG. 1) in the environment. Themeasurement model for robot pose i observing landmark j is given by thefollowing general nonlinear function:

z _(ij) =h(x _(i) ,f _(j))+v _(ij)  (5)

where z_(ij) denotes the measurement, and v_(ij) is zero-mean Gaussianmeasurement noise with covariance R_(ij). And the measurement Jacobianmatrix is given by:

$\begin{matrix}{H_{ij}\overset{\Delta}{=}{\left. \frac{\partial h}{\partial x_{0:k}^{BA}} \right|_{\{ x_{0:k}^{BA}\}} = \left\lbrack {0_{{z} \times {X}}\mspace{14mu} \ldots \mspace{11mu} H_{{xij}\mspace{20mu}}0_{{z} \times {X}}\mspace{14mu} \ldots \mspace{14mu} H_{f_{ij}}\mspace{14mu} 0_{{z} \times 3}\mspace{14mu} \ldots \mspace{14mu} 0_{{z} \times 3}} \right\rbrack}} & (6)\end{matrix}$

where |z| is the dimension of a single exteroceptive sensor measurement,and H_(x) _(ij) and H_(f) _(ij) are the Jacobians with respect to therobot pose at time-step i and the j-th landmark position, respectively.

In one example implementation, the batch-MAP estimator utilizes all theavailable information to estimate the state vector (1). The informationused may include: (i) the prior information about the initial state,described by a Gaussian pdf with mean ̂x_(0|0) and covariance P_(0|0),(ii) the proprioceptive sensor motion information (5), and (iii) theexteroceptive sensor measurements (2). In this example, the batch-MAPestimator seeks to determine the estimate {circumflex over (x)}_(0:k|k)^(BA) that maximizes the posterior pdf:

$\begin{matrix}{{p\left( x_{0:k}^{BA} \middle| z_{0:k} \right)} \propto {{p\left( x_{0} \right)}{\prod\limits_{i = 1}^{k}\; {{p\left( {\left. x_{i} \middle| x_{i - 1} \right.,u_{i - 1_{m}}} \right)}{\prod\limits_{z_{ij} \in Z_{0:k}}\; {p\left( {\left. z_{ij} \middle| x_{i} \right.,f_{j}} \right)}}}}}} & (7)\end{matrix}$

where Z_(0:k) denotes all the available measurements in the timeinterval [0,k]. For Gaussian and independent measurement noises (see(2), and (2), respectively), this pdf (7) can be written as:

$\begin{matrix}{{p\left( x_{0:k}^{BA} \middle| z_{0:k} \right)} \propto {\frac{1}{\sqrt{\left( {2\pi} \right)^{x}{P_{0|0}}}}{\exp \left( {{- \frac{1}{2}}{{x_{0} - {\hat{x}}_{0|0}}}^{2}P_{00}} \right)} \times {\prod\limits_{i = 1}^{k}\; {\frac{1}{\sqrt{\left( {2\pi} \right)^{x}{Q_{i - 1}^{\prime}}}}{\exp \left( {{- \frac{1}{2}}{{x_{i} - {f\left( {x_{i - 1},u_{i - 1_{m}}} \right)}}}_{Q_{i - 1}^{\prime}}^{2}} \right)} \times {\prod\limits_{z_{ij} \in z_{0:k}}\; {\frac{1}{\sqrt{\left( {2\pi} \right)^{x}{R_{ij}}}}{\exp \left( {{- \frac{1}{2}}{{z_{ij} - {h\left( {x_{i},f_{j}} \right)}}}_{R_{ij}}^{2}} \right)}}}}}}} & (8)\end{matrix}$

In the above expression, the notations, ∥a∥_(M) ²

a^(T)M⁻¹a and Q′_(i−1)

G_(i−1)Q_(i−1)G_(i−1) ^(T) are utilized. By taking logarithm and ignoreconstant terms, the maximization of (8) is equivalent to theminimization of the following cost function:

$\begin{matrix}{{c\left( x_{0:k}^{BA} \right)} = {{\frac{1}{2}{{x_{0} - {\hat{x}}_{0|0}}}_{P_{0|0}}^{2}} + {\sum\limits_{i = 1}^{k}{\frac{1}{2}{{x_{i} - {f\left( {x_{i - 1},u_{i - 1_{m}}} \right)}}}_{Q_{i - 1}^{\prime}}^{2}}} + {\sum\limits_{z_{ij} \in z_{0:k}}{\frac{1}{2}{{z_{ij} - {h\left( {x_{i},f_{j}} \right)}}}_{R_{ij}}^{2}}}}} & (9)\end{matrix}$

c(x_(0:k) ^(BA)) is a nonlinear least squares cost function, and astandard approach to determine its minimum is to employ Gauss-Newtoniterative minimization. Specifically, at the l-th iteration of thismethod, a correction, δx_(0:k) ^(BA(l)), to the current estimate,{circumflex over (x)}_(0:k|k) ^(BA(l)), is computed by minimizing thesecond-order Taylor-series approximation of the cost function which isgiven by:

$\begin{matrix}{{{c\left( {{\hat{x}}_{0:k}^{{BA}^{()}} + {\delta \; x_{0:k}^{{BA}^{()}}}} \right)} \simeq {{c\left( {\hat{x}}_{0:{k|k}}^{{BA}^{()}} \right)} + {b_{b}^{{()}^{T}}\delta \; x_{0:k}^{{BA}^{()}}} + {\frac{1}{2}\delta \; x_{0:k}^{{BA}^{{()}^{T}}}A_{b}^{()}\delta \; x_{0:k}^{{BA}^{()}}}}}\mspace{20mu} {where}} & (10) \\{\mspace{79mu} {b_{b}^{()}\overset{\Delta}{=}\left. {\nabla_{x_{0:k}^{BA}}{c( \cdot )}} \right|_{\{{\hat{x}}_{0:{k|k}}^{{BA}^{()}}\}}}} & (11) \\{\mspace{79mu} {A_{b}^{()}\overset{\Delta}{=}\left. {\nabla_{x_{0:k}^{BA}}^{2}{c( \cdot )}} \right|_{\{{\hat{x}}_{0:{k|k}}^{{BA}^{()}}\}}}} & (12)\end{matrix}$

are the gradient and Hessian of c(•) with respect to x_(0:k) ^(BA),evaluated at the current state estimate {circumflex over (x)}_(0:k|k)^(BA(l)).

The structure of the Jacobian and Hessian matrices which will be used inthe ensuing analysis is now examined. Specifically, at the l-thiteration, b_(b) ^((l)) is (see (3) and (6)):

b _(b) ^((l))=Π^(T) P _(0|0) ⁻¹({circumflex over (X)} _(0|k) ^((l))−{circumflex over (X)} _(0|0))+Σ_(i=1) ^(k)Φ_(i=1) ^((l)) ^(T) Q′ _(i−1)⁻¹({circumflex over (X)} _(i|k) ^((l)) −f({circumflex over (X)} _(i−1|k)^((l)) ,u _(i−1) _(m) ))−Σ_(z) _(ij) _(εz) _(0:k) H _(ij) ^((l)) ^(T) R_(ij) ⁻¹(z _(ij) −h(f _(j|k) ^((l))))  (13)

where Π=[I_(|x|) . . . 0]. On the other hand, the Hessian matrix, A_(b)^((l)), is approximated in the Gauss-Newton method by (see (3) and (6)):

A _(b) ^((l))=Π^(T) P _(0|0) ⁻¹Π+Σ_(i=1) ^(k)Φ_(i−1) ^((l)) ^(T) Q′_(i−1) ⁻¹Φ_(i−1) ^((l))+Σ_(z) _(ij) _(εZ) _(0:k) H _(ij) ^((l)) ^(T) R_(ij) ⁻¹ H _(ij) ^((l))  (14)

which is a good approximation for small-residual problems. Due to thesparse structure of the matrices H_(ij) ^((l)) and Ø_(i) ^((l)) (see (3)and (6)), the matrix A_(b) ^((l)) is also sparse, which can be exploitedto speed-up the solution of the linear system in (15). The value ofδx_(0:k) ^(BA) ^((l)) that minimizes (10) is found by solving thefollowing linear system:

A _(b) ^((l)) δx _(0:k) ^(BA) ^((l)) =−b _(b) ^((l))  (15)

Once δx_(0:k) ^(BA) ^((l)) is found, the new state estimate is updatedas:

̂x _(0:k|k) ^(BA) ^((l+1))

x _(0:k|k) ^(BA) ^((l)) ⊕δx _(0:k) ^(BA) ^((l))   (16)

where ⊕ is the corresponding update rule. Given an initial estimatex_(0:k|k) ^(BA) ^((o)) that resides within the attraction basin of theglobal optimum, the iterative algorithm described above will compute theglobal minimum (i.e., MAP estimate) for the entire state given allmeasurements up to time-step k along the trajectory of VINS 10.

Keyframe Based SLAM

As the mobile device continuously moves and observes new landmarks, thesize of the state vector x_(0:k) ^(BA) in the batch MAP estimatorconstantly increases (typically linearly in time). This may not besuitable for all real-time operations. To reduce the computational costof the batch MAP estimator, in one example implementation, estimator 22of VINS 10 stores a threshold number of keyframes, and the keyframes'poses of VINS 10 along with the positions of landmarks observed fromthese keyframes (referred to as “key landmarks”) are estimated withoutcomputing state estimates of VINS 10 for non-keyframes or computingpositions for landmarks observed only from non-keyframes (referred to as“non-key landmarks”). Information from non-keyframes, however, is notdiscarded. Instead, this information is retained throughmarginalization. A marginalization approach with respect to thenon-keyframes and non-key landmarks, and C-KLAM-based estimationtechniques, are described.

FIG. 2 is a block diagram illustrating the structure of an examplecurrent exploration epoch. In FIG. 2, x_(M) denotes the non-key posesbetween the two sets, x_(K) ₁ , and x_(K) ₂ , of key poses along atrajectory. f_(K), f_(M), and f_(B) denote the landmarks observed onlyfrom the key poses, only from the non-key poses, and both from the keyand non-key poses, respectively. The arrows denote the measurementsbetween different states.

Consider the current exploration epoch shown in FIG. 2. Here, estimator22 keeps in its internal state vector the keyframe poses {x_(K) ₁ ,x_(K)₂ } and key landmarks {f_(K),f_(B)} observed from these keyframes, whilemarginalizing out non-keyframe poses {x_(M)} and non-key landmarks{f_(M)} observed exclusively from the non-keyframe poses. As such, thestate vector (1) may be partitioned into:

x _(0:k) ^(BA) =[x _(R) ^(CK) ^(T) x _(M) ^(CK) ^(T) ]^(T) =x^(CK)  (17)

where x_(R) ^(ck)=X_(K) ₁ ^(t) X_(k) ₂ ^(T) f_(K) ^(T) f_(B) ^(T)]^(T)denote the states that will be retained, while x_(M) ^(CK)=[x_(M) ^(T)f_(M) ^(T)]^(T) are that states to be marginalized.

The following notations are used: let x_(K) ₁₂ denote the last pose inx_(K) ₁ and the first pose in x_(K) ₂ . Let u_(K) ₁ , u_(K) ₂ , u_(M)denote the proprioceptive measurements within poses x_(K) ₁ , x_(K) ₂ ,x_(M), respectively. And let u_(K) ₁ _(,M), u_(M,K) ₂ (dotted arrows 53in FIG. 2) denote the proprioceptive measurements that relate the lastpose in x_(K) ₁ with the first in x_(M), and the last pose in x_(M) withthe first in x_(K) ₂ , respectively. Thus, the batch MAP cost function(9) associated with the current exploration epoch is given by:

$\begin{matrix}{{c\left( x^{CK} \right)} = {{{c_{1}\left( x_{R}^{CK} \right)} + {c_{2}\left( {x_{R}^{CK},x_{M}^{CK}} \right)}} = {{c_{1}\left( {x_{K_{1}},x_{K_{2}},f_{K},{f_{B};z_{K_{1}K}},z_{K_{1}B},z_{K_{2}K},z_{K_{2}B},u_{K_{1}},u_{K_{2}}} \right)} + {c_{2}\left( {x_{M},x_{K_{12}},f_{B},{f_{M};z_{MB}},z_{MM},u_{K_{1}M},u_{M,K_{2}},u_{M}} \right)}}}} & (18)\end{matrix}$

where the cost function has been decomposed into two parts: c₂ is thepart of the cost function corresponding to measurements that involve thenon-key poses x_(M) and landmarks f_(M) (denoted measurements 50, 53 inFIG. 2), while c₁ corresponds to measurements that do not involve x_(M)and f_(M) (depicted in measurements 52 in FIG. 2). Note that here c₂ isa function of x_(R) ₂ ^(CK) and x_(M) ^(CK), with x_(R) ₂ ^(CK)=x_(R) ₂^(CK) and x_(M) ^(CK), with x_(R) ₂ ^(CK)=[X_(K) ₁₂ ^(T) f_(B) ^(T)]^(T)(see (18)). Thus, we have:

$\begin{matrix}{{\min\limits_{x_{R}^{CK},x_{M}^{CK}}\mspace{11mu} {c\left( {x_{R}^{CK},x_{M}^{CK}} \right)}} = {\min\limits_{x_{R}^{CK}}{\left( {\min\limits_{x_{M}^{CK}}\; {c\left( {x_{R}^{CK},x_{M}^{CK}} \right)}} \right){\min\limits_{x_{R}^{CK}}\left( {{c_{1}\left( x_{R}^{CK} \right)} + {\min\limits_{x_{M}^{CK}}\; {c_{2}\left( {c_{R_{2}}^{CK},x_{M}^{CK}} \right)}}} \right)}}}} & (19)\end{matrix}$

By employing the second-order Taylor-series approximation to c₂ andminimizing with respect to x_(M) ^(CK), we can obtain:

$\begin{matrix}{{\min\limits_{x_{M}^{CK}}\mspace{11mu} {c_{2}\left( {x_{R}^{CK},x_{M}^{CK}} \right)}} \simeq {\alpha_{p} + {b_{p}^{T}\begin{bmatrix}{x_{K_{12}} - {\hat{x}}_{12_{k|k}}} \\{f_{B} - {\hat{f}}_{B_{k|k}}}\end{bmatrix}} + {{\frac{1}{2}\begin{bmatrix}{x_{K_{12}} - {\hat{x}}_{12_{k|k}}} \\{f_{B} - {\hat{f}}_{B_{k|k}}}\end{bmatrix}}^{T}{A_{p}\begin{bmatrix}{x_{K_{12}} - {\hat{x}}_{12_{k|k}}} \\{f_{B} - {\hat{f}}_{B_{k|k}}}\end{bmatrix}}}}} & (20)\end{matrix}$

where α_(p) is a constant, and

b _(p) =b _(mr) −A _(rm) A _(mm) ⁻¹ b _(mm)  (21)

A _(p) =A _(rr) −A _(rm) A _(rm) ⁻¹ A _(mr)  (22)

with

$\begin{matrix}{b_{m} = {\left. {\nabla_{\{{x_{R_{2}}^{CK},x_{M}^{CK}}\}}{c_{2}( \cdot )}} \right|_{\{{x_{R_{2}}^{CK},{\hat{x}}_{M}^{CK}}\}} = \begin{bmatrix}b_{mr} \\b_{mm}\end{bmatrix}}} & (23) \\{A_{m} = {\left. {\nabla_{\{{x_{R_{2}}^{CK},x_{M}^{CK}}\}}^{2}{c_{2}( \cdot )}} \right|_{\{{{\hat{x}}_{R_{2_{k|k}}}^{CK},{\hat{x}}_{M_{k|k}}^{CK}}\}} = \begin{bmatrix}A_{rr} & A_{rm} \\A_{mr} & A_{m\; m}\end{bmatrix}}} & (24)\end{matrix}$

being the Jacobian and Hessian matrices corresponding to c₂.

FIG. 3 illustrates a structure of the sparse information matrix A_(m)corresponding to the cost function c₂ (constructed by measurements shownwith arrows 50 in FIG. 2). The shaded blocks denote non-zero elements.The sub-matrices A_(B), corresponding to landmarks observed from bothkey and non-key poses, and A_(f) _(M) , corresponding to landmarksobserved only from the non-key poses, are both block diagonal. Thesub-matrix A_(M), corresponding to non-keyframes, is block tri-diagonal.

According to the problem setup (see FIG. 2), the general structure ofthe Hessian A_(m) corresponding to the cost function c₂, is shown inFIG. 3. Using the block notations in FIG. 3 for A_(m), substitute into(21) and (22) obtains:

$\begin{matrix}{b_{p} = {{b_{mr} - {\begin{bmatrix}{B_{K}D^{- 1}} & {B_{K}D_{2}} \\{B_{B}D^{- 1}} & {B_{B}D_{2}}\end{bmatrix}b_{m\; m}}} = \begin{bmatrix}b_{pK} \\b_{pB}\end{bmatrix}}} & (25) \\{A_{p} = {\begin{bmatrix}{A_{K} - {B_{K}D^{- 1}B_{K}^{T}}} & {{- B_{K}}D^{- 1}B_{B}^{T}} \\{{- B_{B}}D^{- 1}B_{K}^{T}} & {A_{B} - {B_{B}D^{- 1}B_{B}^{T}}}\end{bmatrix} = \begin{bmatrix}A_{pKK} & A_{pKB} \\A_{pKB} & A_{pBB}\end{bmatrix}}} & (26)\end{matrix}$

where

D=A _(M) −A _(Mf) _(M) A _(f) _(M) ⁻¹ A _(f) _(MM)   (27)

D ₂ =−D ⁻¹ A _(Mf) _(M) A _(f) _(M) ⁻¹  (28)

Constrained Keyframe Localization and Mapping (C-KLAM)

The marginalization approach, presented in the previous section,projects the information from non-keyframe poses and associated landmarkobservations onto both keyframe poses x_(K) ₁₂ and key landmarks f_(B).According to (26), the Hessian A_(p) after this marginalization processwould be dense in general, despite the fact that the A_(B) blockcorresponding to the key features f_(B) is originally sparse (blockdiagonal). In some cases, this approach may increase the computationalcomplexity dramatically when the number of key features f_(B) becomeslarge, and hence prove non-useful for real-time solutions. To addressthis problem, a second marginalization step may be used in order toproject the information only onto keyframe poses x_(K) ₁₂ but not thekey landmarks f_(B). Thus, in this example implementation, the C-KLAMtechniques use information from the discarded measurements to generateconstraints only between consecutive key poses, x_(K) ₁₂ , whilemaintaining the sparsity of the information matrix. In what follows, oneexample C-KLAM algorithm is described in detail.

Starting from (20), the following approximation can be introduced:

$\begin{matrix}{{\min\limits_{x_{M}^{CK}}\mspace{11mu} {c_{2}\left( {x_{R}^{CK},x_{M}^{CK}} \right)}} = {{\min\limits_{x_{M}^{CK}}\; {c_{2}\left( {x_{K_{12}},f_{B},x_{M}^{CK}} \right)}} \simeq {\min\limits_{f_{B}}\left( {\min\limits_{x_{M}^{CK}}\; {c_{2}\left( {x_{K_{12}},f_{B},x_{M}^{CK}} \right)}} \right)} \simeq {\min\limits_{f_{B}}\left( {\alpha_{p} + {b_{p}^{T}\begin{bmatrix}{x_{K_{12}} - {\hat{x}}_{{K_{12}}_{k|k}}} \\{f_{B} - {\hat{f}}_{B_{k|k}}}\end{bmatrix}} + {{\frac{1}{2}\begin{bmatrix}{x_{K_{12}} - {\hat{x}}_{12_{k|k}}} \\{f_{B} - {\hat{f}}_{B_{k|k}}}\end{bmatrix}}^{T}{A_{p}\begin{bmatrix}{x_{K_{12}} - {\hat{x}}_{K_{{12k}|k}}} \\{f_{B} - {\hat{f}}_{B_{k|k}}}\end{bmatrix}}}} \right)}}} & (29)\end{matrix}$

This is a quadratic function with respect to f_(B) and closed formsolution can be obtained easily. After solving for f_(B) and substituteback into (29), the c₂ cost term can be approximated by:

$\begin{matrix}{{\min\limits_{f_{B}}\left( {\min\limits_{x_{M}^{CK}}\; {c_{2}\left( {x_{K_{12}},f_{B},x_{M}^{CK}} \right)}} \right)} \simeq {\alpha_{d} + {b_{d}^{T}\left( {x_{K_{12}} - {\hat{x}}_{K_{{12K}|K}}} \right)} + {\frac{1}{2}\left( {x_{K_{12}} - {\hat{x}}_{K_{{12k}|k}}} \right)^{T}{A_{d}\left( {x_{K_{12}} - {\hat{x}}_{K_{{12k}|k}}} \right)}}}} & (30)\end{matrix}$

with α_(d) being some constant and (see (25) and (26))

b _(d) =b _(pK) −A _(pKB) A _(pBB) ⁻¹ b _(pB)  (31)

A _(d) =A _(pKK) −A _(pKB) A _(pBB) ⁻¹ A _(pBK)  (32)

And substitute (25)-(28) into (31) and (32), by employing matrixinversion lemma, the following can be obtained:

b _(d) =b _(pk) +B _(K) D ⁻¹ B _(B) ^(T)(A _(B) ⁻¹ +A _(B) ⁻¹ B _(B)(D−B_(B) ^(T) A _(B) ⁻¹ B _(B))⁻¹ B _(B) ^(T) A _(B) ⁻¹)b _(pB)  (33)

A _(d) =A _(K) −B _(K)(D−B _(B) ^(T) A _(B) ⁻¹ B _(B))⁻¹ B _(K)^(T)  (34)

Note that here both A_(B) and A_(f) _(M) are block diagonal and hencetheir inverses can be calculated with linear cost, O(|f_(B)|) andO(|f_(M)|), respectively. The most computationally-intensive calculationin (33) and (34) is that of (D−B_(B) ^(T)A_(B) ⁻¹B_(B))⁻¹, which iscubic, O(|x_(M)|³), in the number of non-key poses currently beingmarginalized. Since this size is bounded, the marginalization in C-KLAMcan be carried out with minimal computational overhead. Now given themarginalization equations (33) and (34), for C-KLAM, the altogetherminimization problem with the total cost function combing (19) and (30)becomes:

$\begin{matrix}{x_{R}^{CK}{\min \left( {{c_{1}\left( x_{R}^{CK} \right)} + {b_{d}^{T}\left( {x_{K_{12}} - {\hat{x}}_{K_{{12k}|k}}} \right)} + {\frac{1}{2}\left( {x_{K_{12}} - {\hat{x}}_{K_{{12k}|k}}} \right)^{T}{A_{d}\left( {x_{K_{12}} - {\hat{x}}_{K_{{12k}|k}}} \right)}}} \right)}} & (35)\end{matrix}$

This can be solved similarly by Gauss-Newton iterative method as in thecase of batch least squares formulation. Now, the state vector containsonly the keyframe poses {x_(K) ₁ ,x_(K) ₂ } and the key landmarks{f_(K),f_(B)}. Importantly, the Hessian matrix for (35) preserves thedesired sparse structure: block tri-diagonal for the poses part andblock diagonal for the landmarks part. Hence C-KLAM may achievesignificantly more efficient solution than the batch least squares SLAMand the standard marginalized keyframe based SLAM.

FIG. 4 is a logical diagram presenting a second example explorationepoch before (left) and after (right) estimator 22 (FIG. 1) appliesC-KLAM approximation techniques described herein. In this example, x₀,x₄ are the keyframes to be retained, and x₁, x₂, and X₃ are thenon-keyframes to be marginalized. Similarly, f₁, f₅ are key landmarks(observed from the keyframes) to be retained, while f₂, f₃, and f₄ arenon-key landmarks (observed exclusively from the non-keyframes) to bemarginalized. In the left portion of FIG. 4, the arrows denote themeasurements between different states. In the right-hand portion of FIG.4, the arrow between x₀, x₄ represents the pose constraint generatedbetween the keyframes using C-KLAM.

Similar to the example above, the motion model for the mobile device canbe given by:

x _(i+1) =f(x _(i) ,u _(i) −w _(i))  (1A)

where f is a general nonlinear function 1, x_(i) and x_(i+1) denote therobot poses at time-steps i and i+1, respectively, u_(i)=u_(i) _(t)+w_(i), is the measured control input (linear acceleration androtational velocity), where u_(i) _(t) denotes the true control input,and w_(i) is the zero-mean, white Gaussian measurement noise withcovariance Q_(i). The measurement model for the mobile device attime-step i, obtaining an observation, z_(ij), to landmark f_(j) isgiven by:

z _(ij) =h(x _(i) ,f _(j))+V _(ij)  (2A)

where h is a general nonlinear measurement function² and v_(ij) is thezero-mean, white Gaussian measurement noise with covariance R_(ij).

Consider the example exploration epoch shown in FIG. 4, consisting offive robot poses, x_(i), i=0, 1, . . . , 4, and of five point landmarks,f_(j), j=1, 2, . . . , 5, observed from these poses. The batch MAPestimates, ̂x_(0:4) ^(MAP),̂f_(1:5) ^(MAP), of all robot poses, x_(0:4),and all landmark positions, f_(1:5), using all available proprioceptive,u_(0:3), and exteroceptive, Z_(0:4), measurements are given by:

$\begin{matrix}{{\hat{x}}_{0:4}^{MAP},{{\hat{f}}_{1:5}^{MAP}\overset{\Delta}{=}{\arg \mspace{11mu} {\max\limits_{x_{0:4},f_{1:5}}\; {\rho \left( {x_{0:4},{f_{1:5}\text{/}z_{0:4}},u_{0:3}} \right)}}}}} & \left( {3A} \right)\end{matrix}$

where Z_(i) denotes the set of all exteroceptive measurements obtainedat robot pose x_(i), i=0, 1, . . . , 4. Under the Gaussian andindependent noise assumptions, (3A) is equivalent to minimizing thefollowing nonlinear least-squares cost function:

$\begin{matrix}{{\left( {x_{0:4},{f_{1:5};z_{0:4}},u_{0:3}} \right)} = {{{{{{{\frac{1}{2}//{x_{0} - {\hat{x}}_{0|0}}}//_{P_{0|0}}^{2}{+ {\sum\limits_{i = 0}^{3}\frac{1}{2}}}}//{x_{i + 1} - {f\left( {x_{i},u_{i}} \right)}}}//_{Q_{i}^{\prime}}^{2}{+ {\sum_{z_{{ij} \in z_{{0:4},}}}\frac{1}{2}}}}//{z_{ij} - {h\left( {x_{i},f_{j}} \right)}}}//_{R_{ij}}^{2}}\overset{\Delta}{=}{{_{P}\left( {x_{0} - {\hat{x}}_{0|0}} \right)} + {\sum\limits_{i = 0}^{3}{_{M}\left( {x_{i + 1},x_{i},u_{i}} \right)}} + {\sum_{z_{{ij} \in z_{{0:4},}}}{_{O}\left( {x_{i},f_{j},z_{ij}} \right)}}}}} & \left( {4A} \right)\end{matrix}$

where x₀˜

(x_(0|0),P_(0|0)) denotes the prior for the robot pose,Q′_(i)=G_(i)Q_(i)G_(i) ^(T), and G_(i) is the Jacobian of f with respectto the noise w_(i). In what follows, the cost terms arising from theprior, the robot motion, and the landmark observations are denoted byC_(P), C_(M), and C_(O), respectively.

One approach for minimizing (4A) is to employ the Gauss-Newton iterativeminimization algorithm with computational complexity up to O([K+N]³),where K and N denote the number of robot poses and landmarks,respectively. Note that, as the robot explores the environment andobserves new landmarks, the size of the optimization problem (both K andN) in (4A) continuously increases. Therefore, for long trajectories withmany features and frequent loop closures, the cost of solving (4A) mayprohibit real-time operation.

In order to reduce the computational complexity of MAP-based SLAM andensure accurate and real-time navigation over long time durations, inaccordance with the described C-KLAM techniques, estimator 22 (i) buildsa sparse map of the environment consisting of only the key robot posesand the distinctive landmarks observed from these key poses, and (ii)uses measurement information from non-key poses to create constraintsbetween the key poses, in order to improve estimation accuracy.

Specifically, for the example in FIG. 4, estimator 22 retains: (i) x₀and x₄ as key poses, and (ii) landmarks, f₁ and f₅, observed from thesekey poses as key landmarks. In this case, (4A) can be split into twoparts as follows:

$\begin{matrix}{ = {\underset{\underset{_{1}{({x_{0},x_{4},f_{1},{f_{5};{\hat{x}}_{0|0}},z_{01},z_{45}})}}{}}{{_{P}\left( {x_{0};{\hat{x}}_{0|0}} \right)} + {_{O}\left( {x_{0},{f_{1};z_{01}}} \right)} + {_{O}\left( {x_{4},{f_{5};z_{45}}} \right)}} + \underset{\underset{_{2}{({x_{1:3},f_{2:4},x_{0},x_{4},f_{1},{f_{5};},z_{1:3},u_{0:3}})}}{}}{{\sum\limits_{i = 0}^{3}{_{M}\left( {{x_{i + 1}.x_{i}},u_{i}} \right)}} + {\sum_{z_{{ij} \in z_{{1:3},}}}{_{O}\left( {x_{i},f_{j},z_{ij}} \right)}}}}} & \left( {5A} \right)\end{matrix}$

The first part of the cost function, C₁, depends only upon the keyposes, key landmarks, and the measurements between them (denoted by thinarrows in FIG. 4). This part consists of cost terms arising from theprior term and from the two exteroceptive measurements, z₀₁ and z₄₅,obtained at the key poses x₀ and x₄, respectively. The second part ofthe cost function, C₂, contains all cost terms that involve non-keyposes and non-key landmarks. Specifically, these correspond to two typesof cost terms: (i) terms that involve only non-key poses and non-keylandmarks (corresponding to measurements denoted by solid lines in FIG.4), e.g., C_(O)(x₁,f₂;z₁₂), and (ii) terms that involve both key andnon-key elements (corresponding to measurements denoted by dashed linesin FIG. 4), e.g., C_(O)(x_(i),f_(i);z₁₁) and C_(M)(x₁,x₀;u₀).

In this example, only two key poses/landmarks are retained in order tosimplify the explanation. However, estimator 22 may apply the C-KLAMtechniques described herein to retain any number of key poses/landmarks.Moreover, estimator 22 may select the key poses based on certaincriteria, e.g., distance traveled between two key poses, poses thatobserve points of interest, uniqueness of an image or other criteria.Furthermore, for the example in FIG. 4, the depth to the features isassumed to be available (e.g., from an RGB-D camera) in order to reducethe number of measurements and poses required. However, if a regularcamera is used, at least two observations of a key feature and thecorresponding poses may be retained.

FIG. 5 illustrates structures of the Hessian matrices, H_(C) ₁ and H_(C)₂ , corresponding to the cost functions C₁ and C₂ [see (5A)],respectively. The colored blocks denote non-zero elements. Specifically,for H_(C) ₂ , associated with the measurements denoted by arrows in FIG.4, the block-diagonal sub-matrices A_(k) and A_(b) correspond to keyposes and key landmarks, respectively. A_(r) and A_(f) correspond tonon-key poses and non-key landmarks to be marginalized, respectively.Here A_(k) and A_(r) are, in general, block tri-diagonal, while A_(b)and A_(f) are block diagonal.

In general, some keyframe-based approaches optimize only over C₁ inorder to reduce the computational complexity, i.e., the cost terms in C₂and the corresponding measurements are discarded, resulting insignificant information loss. In contrast, example implementations ofthe techniques described herein retain a part of the information in C₂to marginalize the non-key poses and landmarks, x_(1:3) and f_(2:4),respectively. Mathematically, this is equivalent to approximating thecost function C by C′ as follows (see FIG. 5):

≅

′(x ₀ ,x ₄ ,f ₁ ,f ₅ ;{circumflex over (x)} _(0|0) ,z ₀₁ ,z ₄₅,{circumflex over (x)} ₀ ,{circumflex over (x)} ₄ ,{circumflex over (f)}₁ ,{circumflex over (f)} ₅)=

₁+

′₂(x ₀ ,x ₄ ,f ₁ ,f ₅ ;{circumflex over (x)} ₀ ,{circumflex over (x)} ₄,{circumflex over (f)} ₁ ,{circumflex over (f)} ₅)  (6A)

where,

$\begin{matrix}{_{2}^{\prime} = {\alpha^{\prime} + {g_{C_{2}^{\prime}}^{T}\begin{bmatrix}x_{0} & - & {\hat{x}}_{0} \\x_{4} & - & {\hat{x}}_{4} \\f_{1} & - & {\hat{f}}_{1} \\f_{5} & - & {\hat{f}}_{5}\end{bmatrix}} + {{\frac{1}{2}\begin{bmatrix}x_{0} & - & {\hat{x}}_{0} \\x_{4} & - & {\hat{x}}_{4} \\f_{1} & - & {\hat{f}}_{1} \\f_{5} & - & {\hat{f}}_{5}\end{bmatrix}}^{T}{H_{C_{2}^{\prime}}\begin{bmatrix}x_{0} & - & {\hat{x}}_{0} \\x_{4} & - & {\hat{x}}_{4} \\f_{1} & - & {\hat{f}}_{1} \\f_{5} & - & {\hat{f}}_{5}\end{bmatrix}}}}} & \left( {7A} \right)\end{matrix}$

with,

$\begin{matrix}{H_{C_{2}^{\prime}} = {\begin{bmatrix}A_{k} & 0 \\0 & A_{b}\end{bmatrix} - {{\begin{bmatrix}B_{k} & 0 \\B_{b} & 0\end{bmatrix}\begin{bmatrix}A_{r} & A_{rf} \\A_{fr} & A_{f}\end{bmatrix}}^{- 1}\begin{bmatrix}B_{k}^{T} & B_{b}^{T} \\0 & 0\end{bmatrix}}}} & \left( {8A} \right) \\{g_{C_{2}^{\prime}} = {{\begin{bmatrix}g_{k} \\g_{b}\end{bmatrix} - {{\begin{bmatrix}B_{k} & 0 \\B_{b} & 0\end{bmatrix}\begin{bmatrix}A_{r} & A_{rf} \\A_{fr} & A_{f}\end{bmatrix}}^{- 1}\begin{bmatrix}g_{r} \\g_{f}\end{bmatrix}}}\overset{\Delta}{=}{\begin{bmatrix}g_{C_{2}^{\prime},k} \\g_{C_{2}^{\prime},b}\end{bmatrix}.}}} & \left( {9A} \right)\end{matrix}$

Here; ̂x₀,̂x₄, ̂f_(i), and̂f₅ are the estimates of x₀, x₄, f₁, and f₅,respectively, at the time of marginalization, a is a constant termindependent of the optimization variables, and g_(k), g_(b), g_(r), andg_(f) are the gradient vectors of C₂ with respect to{x₀,x₄},{f_(i),f₅},{x_(1:3)}, and {f_(2:4)}, respectively. Also,

and

denote the Jacobian and Hessian matrix, respectively. Lastly,

, as expected, is the Schur complement of the diagonal block,corresponding to non-key c₂′ poses and non-key landmarks, of theHessian, H_(C) ₂ , of the original cost function, C₂ (see FIG. 5).

This marginalization of non-key elements creates additional constraintsbetween the key poses and the key landmarks, which directly translatesinto fill-ins in the reduced Hessian matrix

. This destroys the sparse structure of the Hessian matrix, H_(C′)=H_(C)₁ +

that corresponds to the cost function C′ [see (6A)], and substantiallyincreases the computational cost of obtaining a solution to theminimization problem. Due to the relationship between the measurementgraph corresponding to FIG. 4 and the sparsity pattern of the resultingHessian matrix, the exteroceptive measurements from non-key poses to keyfeatures, i.e., z₁₁ and z₃₅, are the ones responsible for generatingfill-ins in the Hessian matrix, H_(C′), after marginalization. Note thatthe proprioceptive measurements between key and non-key poses, i.e., u₀and u₃, also generate fill-ins, but these fill-ins are desirable as theyrepresent constraints between two consecutive key poses aftermarginalization.

One solution to retain the sparsity of the Hessian matrix would be tofirst discard any exteroceptive measurements between non-key poses andkey features (e.g., z₁₁ and z₃₅ in FIG. 4), and then proceed with themarginalization of non-key elements. However, in real-world scenarios,f₁ and f₅ are not single features, but they each correspond to a groupof features. Hence, such an approximation would discard numerousmeasurements, resulting in substantial information loss.

In order to address this problem and maintain the sparse structure ofthe Hessian (information) matrix while incorporating information fromC₂, one example implementation of the C-KLAM techniques described hereincarries out an additional approximation step, i.e., it furtherapproximates C′₂ in (6A) by a quadratic cost term, C″₂(x₀,x₄; ̂x₀, ̂x₄)that constrains only the key poses x₀ and x₄.

Specifically, along with the non-key poses/landmarks, in this example,estimator 22 marginalizes the key landmarks f₁ and f₅, but only from C₂;these key landmarks will still appear as optimization variables in C₁[see (5A)]. Moreover, marginalizing f₁ and f₅ from C₂, while retainingthem in C₁, implies that estimator 22 ignores their data association andtreat them as different features (say f′₁ and f′₅) in C₂.Mathematically, this process can be described by first considering thefollowing equivalent optimization problems [see (4A), (5A), and FIG. 6].Besides the inability to relinearize marginalized states, ignoring thisdata association is the main information loss incurred by C-KLAM ascompared to the batch MAP-based SLAM.

min C(x _(0:4) ,f _(1:5) ;Z _(0:4) ,u _(0:3))

min C (x _(0:4) ,f _(1:5) ,f′ ₁ ,f′ ₅ ;Z _(0:4) ,u _(0:3))

s.t. f ₁ =f′ ₁ ,f ₅ =f′ ₅  (10A)

where,

$\begin{matrix}{\overset{\_}{C} = {{C_{1}\left( {x_{0},x_{4},f_{1},{f_{5};{\hat{x}}_{\frac{0}{0}}},z_{01},z_{45}} \right)} + {{\overset{\_}{C}}_{2}\left( {x_{1:3},f_{2:4},{x_{0,}x_{4}},f_{1}^{\prime},f_{5}^{\prime},z_{1:3},u_{0:3}} \right)}}} & \left( {11A} \right)\end{matrix}$

Note that minimizing the batch-MAP cost function in (4A) is exactlyequivalent to the constrained optimization problem presented in (10A).Now, in order to maintain the sparsity of the Hessian matrix aftermarginalizing the non-key elements, C-KLAM discards the constraint in(10A) and hence assumes that the features f′₁ and f′₅ are distinct fromf₁ and f₅, respectively (see FIG. 6). Due to this relaxation, C ₂ nolonger depends on the key features f₁ and f₅, and hence has no costterms corresponding to measurements between non-key poses and keyfeatures.

FIG. 7 is a logical diagram illustrating a structure of the Hessianmatrix, H_(C) ₂ , corresponding to the cost function C ₂ [see (A)]. Theshaded blocks denote non-zero elements. Note, that this Hessian matrixdoes not have any entries corresponding to the key features f₁ and f₅.Instead, it has entries for the features f′₁ and f′₅. Due to thisapproximation, C-KLAM can now marginalize the features f′₁ and f′₅,along with the non-key elements x_(1:3) and f_(2:4), from C in (11A),thus ensuring that the resulting Hessian matrix remains sparse.Specifically, C-KLAM approximates C ₂ in (11A) by [see FIGS. 5 and 7]:

$\begin{matrix}{{{\overset{\_}{C}}_{2} \simeq {C_{2}^{''}\left( {x_{0},{x_{4};{\hat{x}}_{0}},{\hat{x}}_{4}} \right)}} = {\alpha^{''} + {g_{C_{2}^{''}}^{T}\begin{bmatrix}x_{0} & - & {\hat{x}}_{0} \\x_{4} & - & {\hat{x}}_{4}\end{bmatrix}} + {\begin{bmatrix}x_{0} & - & {\hat{x}}_{0} \\x_{4} & - & {\hat{x}}_{4}\end{bmatrix}^{T}{H_{C_{2}^{''}}\begin{bmatrix}x_{0} & - & {\hat{x}}_{0} \\x_{4} & - & {\hat{x}}_{4}\end{bmatrix}}}}} & \left( {12A} \right)\end{matrix}$

with,

H _(C″) ₂ A _(k) −B _(k)(D−B _(b) ^(T) A _(b) ⁻¹ B _(b))⁻¹ B _(k)^(T)  (13A)

g _(C″) ₂ =g _(C′) _(2,k) +B _(k) D ⁻¹ B _(k) ^(T)·(A _(b) ⁻¹ +A _(b) ⁻¹B _(b)(D−B _(b) ^(T) A _(b) ⁻¹ B _(b))⁻¹ B _(b) ^(T) A _(b) ⁻¹)g _(C′)_(2,b)   (14A)

and

D=A _(r) −A _(rf) A _(f) ⁻¹ A _(fr.)  (15A)

where α″ is a constant, independent of the optimization variables, and

,

denote the Jacobian and Hessian matrix, respectively.

After this approximation, the final C-KLAM cost function becomes:

C _(CKLAM) =C ₁(x ₀ ,x ₄ ,f ₁ ,f ₅ ;{circumflex over (x)} _(0|0) ,Z ₀₁,Z ₄₅)+C″ ₂(x ₀ ,x ₄ ,{circumflex over (x)} ₀ ,{circumflex over (x)}₄)  (16A)

whose corresponding Hessian would be the same as that of C₁ (and thussparse) plus an additional information (relative pose) constraintbetween x₀ and x₄ due to C″₂. In summary, by approximating C₂ by C″₂,C-KLAM is able to incorporate most of the information from the non-keyposes/landmarks, while maintaining the sparsity of the Hessian matrix.Moreover, the part of the cost function, C₁, corresponding to the keyposes/landmarks, remains intact.

Lastly, the approximation (marginalization) described above can becarried out with cost cubic in the number of marginalized non-key poses,and only linear in the number of marginalized non-key landmarks. For thecomplexity analysis, let us assume that we have M non-key poses andM_(f) non-key features to be marginalized, and M_(b) features that areobserved from both key and non-key frames, where M_(f)>>M_(r) andM_(f)>>M_(b). The marginalization step involves the computation of theHessian matrix,

, and the Jacobian,

, according to (13AError! Reference source not found.)-(15A). Forcomputing both the Hessian and the Jacobian, we first need to calculateD in (15AError! Reference source not found.). Since A_(f) isblock-diagonal, A_(f) ⁻¹ in (15A) can be computed with cost onlyO(M_(f)). Moreover, since the number of marginalized non-key features,M_(f), far exceeds M_(r) and M_(b), the cost of computing D remainsO(M_(f)). To compute the Hessian [see (13A)], note that A_(b) is alsoblock-diagonal, hence obtaining (D−B_(b) ^(T)A_(b) ⁻¹B_(b))⁻¹, which isthe most computationally-intensive operation in (13A), requires O(M_(r)³) operations. The cost of calculating the remaining matrixmultiplications and additions in (13AError! Reference source not found.)is significantly lower as compared to this cubic cost.

To compute the Jacobian,

; [see (14A)], we can reuse the values of D, (D−B_(b) ^(T)A_(b)⁻¹B_(b))⁻¹, and A_(b) ⁻¹, which have already been calculated whencomputing the Hessian. In addition, we need to compute D⁻¹, which can befound with complexity O(M_(r) ³). The rest of the computations involveonly matrix-vector multiplications and vector additions at a negligiblecost.

Hence, the overall cost of the marginalization step is cubic in thenumber of marginalized non-key poses, and only linear in the number ofmarginalized non-key landmarks. Since M_(r) is bounded (user defined),the marginalization in C-KLAM can be carried out with minimalcomputational overhead.

Experimental Results

The experimental setup consists of a PointGrey Chameleon camera and aNavchip IMU, rigidly attached on a light-weight (100 g) platform. TheIMU signals were sampled at a frequency of 100 Hz while camera imageswere acquired at 7.5 Hz. The experiment was conducted in an indoorenvironment where the sensor platform followed a 3D rectangulartrajectory, of total length of 144 m, and returned back to the initialposition in order to provide an estimate of the final position error.

In the C-KLAM implementation, the corresponding approximate batch-MAPoptimization problem was solved every 20 incoming camera frames. Theexploration epoch was set to 60 camera frames, from which the first andlast 10 consecutive camera frames were retained as keyframes, while therest were marginalized using the C-KLAM techniques described herein. Theperformance of C-KLAM was compared to that of thecomputationally-intensive, batch MAP-based SLAM [bundle adjustment(BA)], which optimizes over all camera poses and landmarks, using allavailable measurements, to provide high-accuracy estimates as thecomparison baseline. In the BA implementation, the batch-MAPoptimization problem was solved every 20 incoming camera frames.

FIG. 8 provides a simulated overhead x-y view of the estimated 3Dtrajectory and landmark positions. The C-KLAM estimates only keyframes(marked with squares) and key features (marked with circles), while BAestimates the entire trajectory (marked by the solid line) and allfeatures (marked by X's). FIG. 8 shows the x-y view of the estimatedtrajectory and landmark positions. As evident, the estimates of therobot trajectory and landmark positions generated by the C-KLAMtechniques described herein are almost identical to those of the BAtechnique. Loop closure was performed and the final position error was 7cm for C-KLAM, only 5% more than that of the BA.

In terms of speed, C-KLAM took only 4% of the time required for theentire BA. At the end of this experiment, C-KLAM retained 238 keyframesand 349 key landmarks, while BA had 1038 camera frames and 1281landmarks. This significant reduction in the number of estimated statesin C-KLAM led to substantial improvement in efficiency. Moreover, byusing information from non-keyframes to constrain the keyframes, C-KLAMwas able to achieve estimation performance comparable to that of the BA.

FIG. 9 is a flowchart illustrating example operation of a device inaccordance with the techniques described herein. The device may, forexample, comprise a vision-aided inertial navigation system, mobiledevice, laptop, table, robot, vehicle, server, cloud-based processingsystem or other device having a processor or other operating environmentfor implementing the techniques described herein. For purposes ofexplanation, FIG. 9 will be described with respect to VINS 10 andestimator 22 of FIG. 1.

Initially, estimator 22 receives measurement data (100). That is,estimator 22 receives image data 14 produced by an image source 12 ofthe vision-aided inertial navigation system 10 for at least a first andsecond keyframe and one or more non-keyframes along a trajectory of theVINS. The one or more non-keyframes positioned between the firstkeyframe and the second keyframe along the trajectory, and each keyframeand non-keyframe may correspond to a pose (position and orientation) ofVINS 10 including landmarks observed within the environment at thatpose. In addition, estimator 22 receives, from an inertial measurementunit (IMU) 16, IMU data 18 indicative of motion of VINS 10 along thetrajectory for the keyframes and the one or more non-keyframes. In thisway, VINS 10 receives and records, within VINS data 24, image data 14and IMU data 18 for keyframes and non-keyframes along the trajectory.

Estimator 22 selects frames along the trajectory for which respectivestate estimates are to be computed within a state vector (102). That is,estimator 22 determines which of the frames along the trajectory are tobe treated as key frames for which complete estimates are computed. Inone example, state estimates include complete pose information (positionand orientation) for VINS 10 as well as position information for eachfeature observable at that frame. Estimator 22 may select the keyframesbased on a set of criteria, such as one or more of a distance traveledbetween two consecutive key poses and poses at which points of interestwere detected within the image data.

Based on the selection, estimator 22 maintains a state vector tospecify, for computation, state estimates (variables) for each keyframepose of the VINS and each landmark observed from the keyframe (104).Estimator 22 excludes variables for non-keyframe poses or landmarksobserved only from non-keyframes.

Estimator 22 iteratively processes the state vector to compute stateestimates for each keyframe pose of the VINS and each landmark observedfrom each of the keyframes without computing an state estimates forposes of the VINS at non-key frames or estimated positions for landmarksobserved from only the non-keyframes (106). At this time, estimator 22includes within the computation constraints on the poses associates witheach keyframe from the preceding keyframe, where the pose constrains arebased on the IMU data and the image data associated with thenon-keyframes between the keyframes. In addition, estimator 22 maysimilarly constrain the computed estimated position of each key landmarkbased on the IMU data and the image data from the one or morenon-keyframes. Alternatively, estimator 22 may compute the estimatedpositions for each key landmark specified within the state vector basedonly on the measurements associated with the key frames by disregardingthe IMU data and the image data for the non-keyframes, thereby achievingfurther efficiency.

Estimator 22 may compute each of the constraints as (i) an estimate ofthe motion from the first keyframe to the second keyframe (i.e., howmuch VINS 10 has moved and/or rotated between the two keyframes), and(ii) a covariance (or information matrix) providing an indication of anuncertainty of the estimated motion. Moreover, when constraining thestate estimates between keyframes, estimator 22 treats a featureobserved within the image data for the keyframes as different from thatsame feature observed within the image data for the non-keyframes. Inother words, for purposes of computing the state estimates forkeyframes, estimator 22 may disregard dependencies for each of thelandmarks with respect to landmarks observed within the image data forthe one or more non-keyframes.

Based on the computed state estimates, estimator 22 may construct a map,e.g., a 2D or 3D map, of the environment (108). The map may, forexample, include position and orientation information for the VINS alongthe trajectory relative to position information for any landmarksobserved by the VINS. The map may be displayed, stored, used forsubsequent navigation and the like.

FIG. 10 shows a detailed example of various devices that may beconfigured to implement some embodiments in accordance with the currentdisclosure. For example, device 500 may be a mobile sensing platform, amobile phone, a workstation, a computing center, a cluster of servers orother example embodiments of a computing environment, centrally locatedor distributed, capable of executing the techniques described herein.Any or all of the devices may, for example, implement portions of thetechniques described herein for vision-aided inertial navigation system.

In this example, a computer 500 includes a hardware-based processor 510that is operable to execute program instructions or software, causingthe computer to perform various methods or tasks, such as performing theenhanced estimation techniques described herein. Processor 510 may be ageneral purpose processor, a digital signal processor (DSP), a coreprocessor within an Application Specific Integrated Circuit (ASIC) andthe like. Processor 510 is coupled via bus 520 to a memory 530, which isused to store information such as program instructions and other datawhile the computer is in operation. A storage device 540, such as a harddisk drive, nonvolatile memory, or other non-transient storage devicestores information such as program instructions, data files of themultidimensional data and the reduced data set, and other information.As another example, computer 500 may provide an operating environmentfor execution of one or more virtual machines that, in turn, provide anexecution environment for software for implementing the techniquesdescribed herein.

The computer also includes various input-output elements 550, includingparallel or serial ports, USB, Firewire or IEEE 1394, Ethernet, andother such ports to connect the computer to external device such aprinter, video camera, surveillance equipment or the like. Otherinput-output elements include wireless communication interfaces such asBluetooth, Wi-Fi, and cellular data networks.

The computer itself may be a traditional personal computer, a rack-mountor business computer or server, or any other type of computerizedsystem. The computer in a further example may include fewer than allelements listed above, such as a thin client or mobile device havingonly some of the shown elements. In another example, the computer isdistributed among multiple computer systems, such as a distributedserver that has many computers working together to provide variousfunctions.

The techniques described herein may be implemented in hardware,software, firmware, or any combination thereof. Various featuresdescribed as modules, units or components may be implemented together inan integrated logic device or separately as discrete but interoperablelogic devices or other hardware devices. In some cases, various featuresof electronic circuitry may be implemented as one or more integratedcircuit devices, such as an integrated circuit chip or chipset.

If implemented in hardware, this disclosure may be directed to anapparatus such a processor or an integrated circuit device, such as anintegrated circuit chip or chipset. Alternatively or additionally, ifimplemented in software or firmware, the techniques may be realized atleast in part by a computer readable data storage medium comprisinginstructions that, when executed, cause one or more processors toperform one or more of the methods described above. For example, thecomputer-readable data storage medium or device may store suchinstructions for execution by a processor. Any combination of one ormore computer-readable medium(s) may be utilized.

A computer-readable storage medium (device) may form part of a computerprogram product, which may include packaging materials. Acomputer-readable storage medium (device) may comprise a computer datastorage medium such as random access memory (RAM), read-only memory(ROM), non-volatile random access memory (NVRAM), electrically erasableprogrammable read-only memory (EEPROM), flash memory, magnetic oroptical data storage media, and the like. In general, acomputer-readable storage medium may be any tangible medium that cancontain or store a program for use by or in connection with aninstruction execution system, apparatus, or device. Additional examplesof computer readable medium include computer-readable storage devices,computer-readable memory, and tangible computer-readable medium. In someexamples, an article of manufacture may comprise one or morecomputer-readable storage media.

In some examples, the computer-readable storage media may comprisenon-transitory media. The term “non-transitory” may indicate that thestorage medium is not embodied in a carrier wave or a propagated signal.In certain examples, a non-transitory storage medium may store data thatcan, over time, change (e.g., in RAM or cache).

The code or instructions may be software and/or firmware executed byprocessing circuitry including one or more processors, such as one ormore digital signal processors (DSPs), general purpose microprocessors,application-specific integrated circuits (ASICs), field-programmablegate arrays (FPGAs), or other equivalent integrated or discrete logiccircuitry. Accordingly, the term “processor,” as used herein may referto any of the foregoing structure or any other processing circuitrysuitable for implementation of the techniques described herein. Inaddition, in some aspects, functionality described in this disclosuremay be provided within software modules or hardware modules.

Further example details are illustrated in Appendix I, the contents ofwhich are included herein as part of the specification.

Various embodiments of the invention have been described. These andother embodiments are within the scope of the following claims.

What is claimed is:
 1. A device comprising: an image source configuredto produce image data for a first keyframe, one or more nonkeyframes,and a second keyframe along a trajectory of the device, the one or morenon-keyframes located between the first keyframe and second keyframealong the trajectory; an inertial measurement unit (IMU) configured toproduce IMU data indicative of a motion of the device along thetrajectory; and a processor configured to: process the IMU data and theimage data to compute respective state estimates for a position andorientation of the device for the first keyframe and for the secondkeyframe; process the image data associated with the non-keyframes tocompute one or more constraints to the position and the orientation ofthe device for the second keyframe relative to the position and theorientation of the device for the first keyframe, wherein when computingthe one or more constraints the processor is configured to treat alandmark observed within the image data for the first keyframe or thesecond keyframe as different from the same landmark observed within theimage data for the non-keyframes; and apply the one or more constraintswhen computing the state estimates of the device for the secondkeyframe.
 2. The device of claim 1, wherein, to process the image dataassociated with the non-keyframes to compute the one or more constraintsto the position and the orientation of the device for the secondkeyframe relative to the position and the orientation of the device forthe first keyframe, the processor is configured to process both the IMUdata and the image data associated with the non-keyframes to compute theone or more constraints to the position and the orientation of thedevice for the second keyframe relative to the position and theorientation of the device for the first keyframe.
 3. The device of claim1, wherein, to process the image data associated with the non-keyframesto compute the one or more constraints to the position and theorientation of the device for the second keyframe relative to theposition and the orientation of the device for the first keyframe, theprocessor is configured to: process the image data associated with thenon-keyframes to compute the one or more constraints to the position andthe orientation of the device for the second keyframe relative to theposition and the orientation of the device for the first keyframe; anddiscard IMU data associated with the non-keyframes.
 4. The device ofclaim 1, wherein the processor is further configured to compute each ofthe constraints as an estimate of motion from the first keyframe to thesecond keyframe, and a covariance indicating an uncertainty of theestimated motion.
 5. The device of claim 1, wherein the processor isfurther configured to compute the respective state estimates for thefirst keyframe and the second keyframe without computing state estimatesfor the position and the orientation of the device for each of thenon-keyframes.
 6. The device of claim 1, wherein the processor isfurther configured to compute the state estimates to include arespective estimated position of each landmark observed for the firstkeyframe and the second keyframe, and wherein, to compute the stateestimates for the second keyframe, the processor is further configuredto constrain the position and orientation of the device within thesecond keyframe based on the image data from the one or morenon-keyframes while disregarding dependencies for each of the landmarkswith respect to landmarks observed within the image data for the one ormore non-keyframes.
 7. The device of claim 1, wherein the processor isconfigured to compute estimates for positions of landmarks observed fromthe first keyframe and the second keyframe without computing estimatesfor positions of landmarks observed only from the non-keyframes forwhich a respective pose of the device is not computed.
 8. The device ofclaim 1, wherein the processor is configured to maintain a state vectorthat specifies, for computation, state estimates of the position and theorientation of the device for the first keyframe and one or morelandmarks observed from the first keyframe or the second keyframewithout including variables for non-keyframe poses or landmarks observedonly from non-keyframes, wherein the processor is configured toiteratively process the state vector to compute the state estimates, andwherein, for each iteration, the processor is configured to constrainupdates for the state estimates for the position and orientation of thedevice and the landmarks observed from the keyframes for the devicebased on the image data for the one or more non-keyframes of the device.9. The device of claim 8, wherein the processor is further configuredto: select key frames along the trajectory for which respective stateestimates are to be computed within the state vector, and select the keyframes based on a set of criteria comprising one or more of a distancetraveled between two consecutive key poses and poses at which points ofinterest were detected within the image data.
 10. The device of claim 1,wherein the processor is configured to build a map of the environment toinclude the state estimates of the device.
 11. The device of claim 1,wherein the device is integrated within a mobile phone or a robot.
 12. Asystem comprising: an image source to produce image data for a firstkeyframe, one or more non-keyframes, and a second keyframe along atrajectory of a vision-aided inertial navigation system (VINS), the oneor more non-keyframes located between the first keyframe and secondkeyframe along the trajectory; an inertial measurement unit (IMU) toproduce IMU data indicative of a motion of the VINS along thetrajectory; and a processing unit comprising an estimator that processesthe IMU data and the image data to compute respective state estimatesfor a position and orientation of the VINS for the first keyframe andfor the second keyframe, wherein the estimator processes the image dataassociated with the non-keyframes to compute one or more constraints tothe position and the orientation of the VINS for the second keyframerelative to the position and the orientation of the VINS for the firstkeyframe, wherein, when computing the one or more constraints, theprocessor is configured to treat a landmark observed within the imagedata for the first keyframe or the second keyframe as different from thesame landmark observed within the image data for the non-keyframes, andwherein the estimator applies the one or more constraints when computingthe state estimates of the VINS for the second keyframe.
 13. The systemof claim 12, wherein, to process the image data associated with thenon-keyframes to compute the one or more constraints to the position andthe orientation of the VINS for the second keyframe relative to theposition and the orientation of the VINS for the first keyframe, theestimator is configured to process both the IMU data and the image dataassociated with the non-keyframes to compute the one or more constraintsto the position and the orientation of the VINS for the second keyframerelative to the position and the orientation of the VINS for the firstkeyframe.
 14. The system of claim 12, wherein, to process the image dataassociated with the non-keyframes to compute the one or more constraintsto the position and the orientation of the VINS for the second keyframerelative to the position and the orientation of the VINS for the firstkeyframe, the estimator is configured to: process the image dataassociated with the non-keyframes to compute the one or more constraintsto the position and the orientation of the VINS for the second keyframerelative to the position and the orientation of the VINS for the firstkeyframe; and discard IMU data associated with the non-keyframes.
 15. Amethod comprising: producing, by an image source, image data for a firstkeyframe, one or more nonkeyframes, and a second keyframe along atrajectory of a vision-aided inertial navigation system (VINS), the oneor more non-keyframes located between the first keyframe and secondkeyframe along the trajectory; producing, by an inertial measurementunit (IMU), IMU data indicative of a motion of the VINS along thetrajectory; processing, by one or more processors, the IMU data and theimage data to compute respective state estimates for a position andorientation of the VINS for the first keyframe and for the secondkeyframe; processing, by the one or more processors, the image dataassociated with the non-keyframes to compute one or more constraints tothe position and the orientation of the VINS for the second keyframerelative to the position and the orientation of the VINS for the firstkeyframe, wherein, when computing the one or more constraints theprocessor is configured to treat a landmark observed within the imagedata for the first keyframe or the second keyframe as different from thesame landmark observed within the image data for the non-keyframes; andapplying, by one or more processors, the one or more constraints whencomputing the state estimates of the VINS for the second keyframe. 16.The method of claim 15, wherein processing the image data associatedwith the non-keyframes to compute the one or more constraints to theposition and the orientation of the VINS for the second keyframerelative to the position and the orientation of the VINS for the firstkeyframe comprises processing both the IMU data and the image dataassociated with the non-keyframes to compute the one or more constraintsto the position and the orientation of the VINS for the second keyframerelative to the position and the orientation of the VINS for the firstkeyframe.
 17. The method of claim 16, wherein processing the image dataassociated with the non-keyframes to compute the one or more constraintsto the position and the orientation of the VINS for the second keyframerelative to the position and the orientation of the VINS for the firstkeyframe comprises: processing the image data associated with thenon-keyframes to compute the one or more constraints to the position andthe orientation of the VINS for the second keyframe relative to theposition and the orientation of the VINS for the first keyframe; anddiscarding IMU data associated with the non-keyframes.